{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Accessing data in CDP with the SDK.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/bradleywhitlock/CDP-Exploration/blob/master/Accessing_data_in_CDP_with_the_SDK.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "metadata": {
        "id": "rHbgiH0OPPTY",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# [Tour of the Cognite Data Platform using the SDK](https://doc.cognitedata.com)\n",
        "*January 2019*\n",
        "\n",
        ">This notebook provides a quick reference for common access patterns to the Cognite Data Platform via the [python SDK](https://github.com/cognitedata/cognite-sdk-python).\n",
        "For a detailed explanation of concepts please see [our documentation homepage](https://doc.cognitedata.com).\n",
        "\n",
        "<hr>"
      ]
    },
    {
      "metadata": {
        "id": "Kme2YxHMPPTa",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Prerequisites\n",
        "\n",
        "- Install dependencies, including the `cognite-sdk`, with either `pip` (requirements.txt) or `pipenv` (PipFile) or use Google Collab\n",
        "- Make sure that you have received an API key from http://openindustrialdata.com/ \n",
        "- Set the API key as an environment variable `PUBLICDATA_API_KEY`, or paste it into the prompt.\n"
      ]
    },
    {
      "metadata": {
        "id": "6VaVx0WMPZE7",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!pip install cognite-sdk"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "rxbmmYglPPTb",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "%matplotlib inline\n",
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "pd.set_option('display.max_rows', 10)\n",
        "import os\n",
        "\n",
        "import datetime\n",
        "\n",
        "from cognite.client import CogniteClient"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "NN_Es6w4PPTd",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# 1.  [Projects](https://doc.cognitedata.com/concepts/#projects) and Authentication with API keys\n",
        "\n",
        "Projects are the top level organization in CDP. Each customer normally has a single project.\n",
        "\n",
        "Authentication to CDP is done using API keys.\n",
        "Keys are granted different permission levels.\n",
        "For Open Industrial Data in this tutorial we only provide read access. Best practice is to use read only keys in exploratory analysis and development and keys with write access only when deployment."
      ]
    },
    {
      "metadata": {
        "id": "O7G1vBPiPPTf",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from getpass import getpass\n",
        "API_KEY = os.environ['PUBLICDATA_API_KEY'] if 'PUBLICDATA_API_KEY' in os.environ else getpass(\"Enter API key:\")\n",
        "client = CogniteClient(api_key=API_KEY, project=\"publicdata\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "aAR244wDPPTh",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# 2.  [Assets](https://doc.cognitedata.com/concepts/#assets)\n",
        "\n",
        "The data in the Cognite Data Platform is structured by assets, where an asset can be a specific piece of equipment or an equipment type.\n",
        "\n",
        "### Discovering assets\n",
        "To discover assets in the hierarchy, `clients.assets.search_for_assets` provides text search, but it is normally useful to load the entire hierarchy with `client.assets.get_assets()` and take advantage of pandas' filtering.\n",
        "When downloading large hierarchies, be sure to set `autopaging=True` to get all the records."
      ]
    },
    {
      "metadata": {
        "id": "YwgGVOFWPPTi",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# when you know the equipment, fuzzy search through names\n",
        "client.assets.search_for_assets(name='23-KA-9101xxx', limit=5).to_pandas()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "7mDIRrXzPPTk",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# or even wild card search through the asset descriptions\n",
        "client.assets.search_for_assets(query=\"lube oil\", limit=5).to_pandas()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "scrolled": false,
        "id": "Cv4-6IWFPPTn",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# fetch all assets\n",
        "df_assets = client.assets.get_assets().to_pandas()\n",
        "\n",
        "# fetch the highest level assets that contain the word compressor\n",
        "df_assets[df_assets.description.str.lower().str.contains(\"compressor\")].sort_values('depth').head(5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "H8hRxs5OPPTo",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Traversing the asset hierarchy\n",
        "Once you know the asset Id, then there are further options for traversing the hierarchy.\n",
        "\n",
        "The `get_assets()` function maps to the low level API functions. The following parameters are useful:\n",
        " - `path`: Fetches all assets below a certain node in the hierarchy. Passes directly to the API and therefore requires a json list of ids (i.e. `\"[1,2,3]\"`).\n",
        " - `depth`: How many levels down the asset hierarchy to traverse\n",
        " \n",
        "The above parameters are useful to understand because they appear throughout the API, but more intuitive access to subtrees is available with `client.assets.get_asset_subtree`.\n"
      ]
    },
    {
      "metadata": {
        "id": "HOXZBSlDPPTp",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# Display the hierarchy below the first stage compressor node\n",
        "dcols = [\"depth\", \"description\", \"id\", \"name\", \"parentId\", \"path\"]\n",
        "FIRST_STAGE_COMPRESSOR_ID = 4518112062673878\n",
        "df_asset_sample = client.assets.get_asset_subtree(FIRST_STAGE_COMPRESSOR_ID, depth=1)\\\n",
        "    .to_pandas()\\\n",
        "    .sort_values(by=[\"depth\"])\n",
        "\n",
        "df_asset_sample.head(5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "sKr5gOVrPPTq",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "**Tip:** The hierarchy returns useful information about hierarchy in the near vicinity.\n",
        "`parentId` gives the id of the immediate parent, and `path` gives the nodes up to the root node of the project.\n",
        "This is useful for grabbing assets upwards in the hierarchy:"
      ]
    },
    {
      "metadata": {
        "id": "-jp9_WPDPPTs",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# show the lineage of an asset using path\n",
        "lineage = df_asset_sample.loc[0, 'path']\n",
        "print(\"path: \", lineage)\n",
        "\n",
        "# show the uncles and aunts of the first stage compressor\n",
        "client.assets.get_asset_subtree(lineage[-2], depth=1).to_pandas().sort_values(\"depth\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "xorUHuNhPPTu",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# 3. [Timeseries](https://doc.cognitedata.com/concepts/#time-series)\n",
        "\n",
        "A time series consists of a sequence of data points connected to a single asset. Here we take a quick look at some common access patterns.\n",
        "\n",
        "### Discovering timeseries\n",
        "Similar parameters from the asset API are available for discovering timeseries by asset. `client.time_series.get_time_series` can metadata on timeseries below a certain node in the hierarchy with the following notable arguments:\n",
        " - `path`: Fetches all assets below a certain node in the hierarchy. Passes directly to the API and therefore requires a json list of ids (i.e. `\"[1,2,3]\"`).\n",
        " - `depth`: How many levels down the asset hierarchy to traverse\n",
        " \n",
        "Again, if a large number of results are expected, the `autopaging` parameter is recommended."
      ]
    },
    {
      "metadata": {
        "id": "V3s_BWZoPPTv",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "#Get all time series of a particular asset\n",
        "timeseries_metadata = client.time_series.get_time_series(path=str([FIRST_STAGE_COMPRESSOR_ID])).to_pandas()\n",
        "timeseries_metadata"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "gcruim8cPPTx",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# again, pandas functionality is hard to beat for finding the timeseries you need\n",
        "scrubber_timeseries = timeseries_metadata[\n",
        "    timeseries_metadata['description'].str.lower().str.contains('scrubber') &\\\n",
        "    timeseries_metadata['name'].str.contains('Value') \n",
        "    ].reset_index(drop=True)\n",
        "scrubber_timeseries"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "CvLdLfqwPPTz",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "scrubber_timeseries.name.tolist()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "QcmM0hs7PPT1",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "### Downloading timeseries data with [datapoints](https://doc.cognitedata.com/concepts/#data-points)\n",
        "\n",
        "One of the strengths of CDP is the expressiveness of timeseries downloads. The `datapoints` client provides access timeseries data in two ways:\n",
        "- `client.datapoints.get_datapoints` for a single timeseries, capable of downloading raw datapoints\n",
        "- `client.datapoints.get_datapoints_frame` for multiple timeseries clocked to a common time axis\n",
        "\n",
        "For both APIs, the following parameters are important to understand:\n",
        "- `start` and `end` can be specified as python `datetime` objects, milliseconds since epoch UTC or with `timeunits`.  Time units ; `N[timeunit]-ago` where `timeunit` is w,d,h,m,s \n",
        "- `granularity` specifies the aggregation windows using `timeunits`, e.g. `10m`\n",
        "- `aggregates` specifies the list of aggregate functions you wish to apply to the data. Valid aggregate functions are: 'average/avg, max, min, count, sum, interpolation/int, stepinterpolation/step'.\n",
        "\n",
        "The `granularity` parameter standardizes the time axis of the data returned from CDP, and the `aggregation` parameter sets the up- or downsampling algorithm. Providing neither of these parameters to the `get_datapoints` function returns raw datapoints.\n",
        "\n",
        "**Tip:** the timestamps in CDP are in units of milliseconds since epoch time. The easiest way to transform this to a python `datetime` is to use `pd.to_datetime(df.timestamp, unit='ms')`.\n",
        "\n",
        "#### Raw datapoints"
      ]
    },
    {
      "metadata": {
        "id": "ccRjgM_QPPT1",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# Note that: openindustrialdata.com has a lag of 1 week for data security\n",
        "\n",
        "# Fetch an hour of raw data\n",
        "df_raw = client.datapoints.get_datapoints(\n",
        "    name=scrubber_timeseries.loc[0, 'name'],\n",
        "    start=datetime.datetime.now() - datetime.timedelta(hours=8*24 + 1),\n",
        "    end=datetime.datetime.now() - datetime.timedelta(hours=8*24),\n",
        ").to_pandas()\n",
        "\n",
        "df_raw.set_index(pd.to_datetime(df_raw.timestamp, unit='ms'))['value'].plot()\n",
        "df_raw"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "Wqbq-GxzPPT5",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Resampled dataframes\n",
        "Tabular structures serve as a good entry point for downstream analysis. From these structures we can proceed to do data quality, exploratory data analysis and model development. Of course, for more complex feature engineering, we often need to use the raw datapoints using the datapoints approach shown above."
      ]
    },
    {
      "metadata": {
        "id": "hcK0WHU0PPT5",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "df_frame = client.datapoints.get_datapoints_frame(\n",
        "    time_series=scrubber_timeseries['name'].tolist(),\n",
        "    start='14d-ago',\n",
        "    end='7d-ago',\n",
        "    granularity='1h',\n",
        "    aggregates=['avg'],\n",
        ")\n",
        "\n",
        "df_frame = df_frame.set_index(pd.to_datetime(df_frame.timestamp, unit='ms')).drop('timestamp', axis=1)\n",
        "df_frame"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "A5suBGf4PPT7",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "from sklearn.preprocessing import QuantileTransformer\n",
        "pd.DataFrame(\n",
        "    data=QuantileTransformer(output_distribution='normal', n_quantiles=100).fit_transform(df_frame),\n",
        "    columns=df_frame.columns,\n",
        "    index=df_frame.index\n",
        ").plot(legend=False)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "ofnWS-0BPPT-",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Other useful datapoint access patterns\n"
      ]
    },
    {
      "metadata": {
        "id": "hGnxTPELPPT_",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# get the latest datapoint\n",
        "client.datapoints.get_latest(scrubber_timeseries.loc[0, 'name']).to_pandas()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "tswc7H9UPPUB",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "# listening for streaming data\n",
        "for dp in client.datapoints.live_data_generator(scrubber_timeseries.loc[0, 'name']):\n",
        "    print(dp)"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
